NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification

Feng, Yunzhen; Dohmatob, Elvis; Yang, Pu; Charton, Francois; Kempe, Julia (October 2024, 2025 International Conference on Learning Representations)

Large Language Models (LLM) are increasingly trained on data generated by other LLM, either because generated text and images become part of the pre-training corpus, or because synthetized data is used as a replacement for expensive human-annotation. This raises concerns about \emph{model collapse}, a drop in model performance when their training sets include generated data. Considering that it is easier for both humans and machines to tell between good and bad examples than to generate high-quality samples, we investigate the use of verification on synthesized data to prevent model collapse. We provide a theoretical characterization using Gaussian mixtures, linear classifiers, and linear verifiers to derive conditions with measurable proxies to assess whether the verifier can effectively select synthesized data that leads to optimal performance. We experiment with two practical tasks -- computing matrix eigenvalues with transformers and news summarization with LLMs -- which both exhibit model collapse when trained on generated data, and show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse and that our proposed proxy measure strongly correlates with performance.
more » « less
Full Text Available
A Tale of Tails: Model Collapse as a Change of Scaling Laws

Dohmatob, Elvis; Feng, Yunzhen; Yang, Pu; Charton, Francois; Kempe, Julia (July 2024, International Conference on Machine Learning 2024)

As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? Will future models, still improve, or be doomed to degenerate up to total (model) collapse? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, the ''un-learning" of skills, and grokking when mixing human and synthesized data. Our theory is validated by large-scale experiments with a transformer on an arithmetic task and text generation using the large language model Llama2.
more » « less
Full Text Available
A Tale of Tails: Model Collapse as a Change of Scaling Laws

Feng, Yunzhen; Dohmatob, Elvis; Yang, Pu; Charton, Francois; Kempe, Julia (May 2024, ICLR Workshop on Navigating and Addressing Data Problems for Foundation Models (DPFM))

As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? Will future models, still improve, or be doomed to degenerate up to total (model) collapse? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, ``un-learning" of skills, and grokking when mixing human and synthesized data. Our theory is validated by large-scale experiments with a transformer on an arithmetic task and text generation using the large language model Llama2.
more » « less
Full Text Available
All-cellulose hydrogel-based adhesive

https://doi.org/10.59717/j.xinn-mater.2023.100040

Sun, Xia; Pang, Zhenqian; Zhu, Yeling; Yu, Zhengyang; Yang, Pu; Liu, Liyang; Renneckar, Scott; Li, Teng; Jiang, Feng (January 2023, The Innovation Materials)

Hydrogels showing strong adhesion to different substrates have garnered significant attention for engineering applications. However, the current development of such hydrogel-based adhesive is predominantly limited to synthetic polymers, owing to their exceptional performance and an extensive array of chemical options. To advance the development of sustainable hydrogel-based adhesives, we successfully create a highly robust all-cellulose hydrogel-based adhesive, which is composed of concentrated dialcohol cellulose nanorods (DCNRs) and relies on enhanced hydrogen bonding interactions between cellulose and the substrate. We implement a sequential oxidization-reduction process to achieve this high-performance all-cellulose hydrogel, which is realized by converting the two secondary hydroxyl groups within an anhydroglucose unit into two primary hydroxyl groups, while simultaneously linearizing the cellulose chains. Such structural and chemical modifications on cellulose chains increase out-of-plane interactions between the DCNRs hydrogel and substrate, as simulations indicate. Additionally, these modifications enhance the flexibility of the cellulose chains, which would otherwise be rigid. The resulting all-cellulose hydrogels demonstrate injectability and strong adhesion capability to a wide range of substrates, including wood, metal, glass, and plastic. This green and sustainable all-cellulose hydrogel-based adhesive holds great promise for future bio-based adhesive design.
more » « less
Full Text Available
Information Design in Spatial Resource Competition

Yang, Pu; Iyer, Krishnamurthy; Frazier, Peter I. (December 2019, Lecture notes in computer science)

We consider information design in spatial resource competition, motivated by ride sharing platforms sharing information with drivers about rider demand. Each of N co-located agents (drivers) decides whether to move to another location with an uncertain and possibly higher resource level (rider demand), where the utility for moving increases in the resource level and decreases in the number of other agents that move. A principal who can observe the resource level wishes to share this information in a way that ensures a welfare-maximizing number of agents move. Analyzing the principal’s information design problem using the Bayesian persuasion framework, we study both private signaling mechanisms, where the principal sends personalized signals to each agent, and public signaling mechanisms, where the principal sends the same information to all agents. We show: 1) For private signaling, computing the optimal mechanism using the standard approach leads to a linear program with 2 N variables, rendering the computation challenging. We instead describe a computationally efficient two-step approach to finding the optimal private signaling mechanism. First, we perform a change of variables to solve a linear program with O(N^2) variables that provides the marginal probabilities of recommending each agent move. Second, we describe an efficient sampling procedure over sets of agents consistent with these optimal marginal probabilities; the optimal private mechanism then asks the sampled set of agents to move and the rest to stay. 2) For public signaling, we first show the welfare-maximizing equilibrium given any common belief has a threshold structure. Using this, we show that the optimal public mechanism with respect to the sender-preferred equilibrium can be computed in polynomial time. 3) We support our analytical results with numerical computations that show the optimal private and public signaling mechanisms achieve substantially higher social welfare when compared with no-information and full-information benchmarks.
more » « less
Full Text Available
Mean Field Equilibria for Resource Competition in Spatial Settings

https://doi.org/10.1287/stsy.2018.0018

Yang, Pu; Iyer, Krishnamurthy; Frazier, Peter (December 2018, Stochastic Systems)

Full Text Available
Influence of Material Microstructure and Processing Characteristics on Extrusion-Based Printing: Linking Experiments and Modeling

Yang, Pu; Nair, Sooraj Kumar; Neithalath, Narayanan (November 2018, 1st International Conference on 3D Construction Printing)

Characterization of paste flow is important in ensuring rheological control during printing. The interaction between the rheological characteristics and processing parameters are better studied through a combination of experimental and simulation tools. For fresh pastes and concrete, discrete element method (DEM)-based simulations are appropriate to provide insights into the particle scale processes occurring during extrusion-based printing, and to relate them to the macro-scale response of the entire system. In this paper, we model the extrusion process of a plain ordinary Portland cement (OPC) paste using DEM, and outline the methodology adopted to evaluate the linkage between particle scale processes and extrusion process. An analytical model for a frictional plastic material undergoing ram extrusion is also used in conjunction with the DEM model to arrive at the yield stresses and shaping stresses that enable efficient extrusion process, as a function of the material microstructure.
more » « less
Full Text Available

Search for: All records